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Abstract —We propose two robust methods for anomaly 
detection in dynamic networks in which the properties of 
normal traffic are time-varying. We formulate the robust 
anomaly detection problem as a binary composite hypothesis 
testing problem and propose two methods: a model-free and 
a model-based one, leveraging techniques from the theory of 
large deviations. Both methods require a family of Probability 
Laws (PLs) that represent normal properties of traffic. We 
devise a two-step procedure to estimate this family of PLs. 
We compare the performance of our robust methods and 
their vanilla counterparts, which assume that normal traffic is 
stationary, on a network with a diurnal normal pattern and a 
common anomaly related to data exfiltration. Simulation results 
show that our robust methods perform better than their vanilla 
counterparts in dynamic networks. 

Index Terms —Robust statistical anomaly detection, large 
deviations theory, set covering, binary composite hypothesis 
testing. 

I. Introduction 

A network anomaly is any potentially malicious traffic 
sequence that has implications for the security of the net¬ 
work. Although automated online traffic anomaly detection 
has received a lot of attention, this field is far from mature. 

Network anomaly detection belongs to a broader field 
of system anomaly detection whose approaches can be 
roughly grouped into two classes: signature-based anomaly 
detection , where known patterns of past anomalies are used 
to identify ongoing anomalies [1], [2], and change-based 
anomaly detection that identifies patterns that substantially 
deviate from normal patterns of operations [3], [4], [5]. [6] 
showed that the detection rates of systems based on pattern 
matching are below 70%. Furthermore, such systems cannot 
detect zero-day attacks, i.e., attacks not previously seen, and 
need constant (and expensive) updating to keep up with 
new attack signatures. In contrast, change-based anomaly 
detection methods are considered to be more economic and 
promising since they can identify novel attacks. In this work 
we focus on change-based anomaly detection methods, in 
particular on statistical anomaly detection that leverages 
statistical methods. 

Standard statistical anomaly detection consists of two 
steps. The first step is to learn the “normal behavior” by 
analyzing past system behavior; usually a segment of records 
corresponding to normal system activity. The second step is 
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to identify time instances where system behavior does not 
appear to be normal by monitoring the system continuously. 

For anomaly detection in networks, [5] presents two meth¬ 
ods to characterize normal behavior and to assess deviations 
from it based on the Large Deviations Theory (LDT) [7]. 
Both methods consider the traffic, which is a sequence of 
flows, as a sample path of an underlying stochastic process 
and compare current network traffic to some reference net¬ 
work traffic using LDT. One method, which is referred to 
as the model-free method, employs the method of types [7] 
to characterize the type (i.e., empirical measure) of an 
independent and identically distributed (i.i.d.) sequence of 
network flows. The other method, which is referred to as the 
model-based method, models traffic as a Markov Modulated 
Process. Both methods rely on a stationarity assumption 
postulating that the properties of normal traffic in networks 
do not change over time. 

However, the stationarity assumption is rarely satisfied in 
contemporary networks [8]. For example, Internet traffic is 
subject to weekly and diurnal variations [9], [10], Internet 
traffic is also influenced by macroscopic factors such as 
important holidays and events [11], Similar phenomena arise 
in local area networks as well. We will call a network 
dynamic if its traffic exhibits time-varying behavior. 

The challenges for anomaly detection of dynamic net¬ 
works are two-fold. First, the methods used for learning the 
“normal behavior” are usually quite sensitive to the presence 
of non-stationarity. Second, the modeling and prediction of 
multi-dimensional and time-dependent behavior is hard. 

To address these challenges, we generalize the vanilla 
model-free and model-based methods from [5] and develop 
what we call the robust model-free and the robust model- 
based methods. The novelties of our new methods are as 
follows. First, our methods are robust and optimal in the 
generalized Neyman-Pearson sense. Second, we propose a 
two-stage method to estimate Probability Laws (PLs) that 
characterize normal system behaviors. Our two-stage method 
transforms a hard problem (i.e., estimating PLs for multi¬ 
dimensional data) into two well-studied problems: ( i ) es¬ 
timating one-dimensional data parameters and (ii) the set 
cover problem. Being concise and interpretable, our esti¬ 
mated PLs are helpful not only in anomaly detection but 
also in understanding normal system behavior. 

The structure of the paper is as follows. Sec. pl| formulates 
system anomaly detection as a binary composite hypothesis 


testing problem and proposes two robust methods. Sec. Ill 


applies the methods presented in Sec. |B] Sec. IV explains 
the simulation setup and presents results from our robust 
methods as well as their vanilla counterparts. Finally, Sec. [V] 
provides concluding remarks. 
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II. Binary composite hypothesis testing 

We model the network environment as a stochastic process 
and estimate its parameters through some reference traffic 
(viewed as sample paths). Then the problem of network 
anomaly detection is equivalent to testing whether a sequence 
of observations Q = {g 1 ,..., g n } is a sample path of a 
discrete-time stochastic process Sf = {G 1 ,..., G n } (hypoth¬ 
esis "Ho). All random variables G' are discrete and their 
sample space is a finite alphabet E = {eri, 02 ,..., < 7 |£i}, 
where |E| denotes the cardinality of E. All observed symbols 
g l belong to E, too. This problem is a binary composite 
hypothesis testing problem. Because the joint distribution of 
all random variables G l in Sf becomes complex when n is 
large, we propose two types of simplification. 

A. A model-free method 

We propose a model-free method that assumes the random 
variables G l are i.i.d. Each 6 ” takes the value Oj with 
probability pf {G l = af), j = 1,..., |E|, which is parame¬ 
terized by 6 £ We refer to the vector = (pf (G l = 
0 1 ),... ,pf(G l = cr| S |)) as the model-free Probability Law 
(PL) associated with 9. Then the family of model-free PLs 
S? F = jp^ : 0 £ fl} characterizes the stochastic process S. 

To characterize the observation Q, let 

1 n 

= = j = i)■ ■ • i|E|, (i) 

Tl 

i=l 

where l(-) is an indicator function. Then, an estimate for 
the underlying model-free PL based on the observation Q 
is £ G F = {<op(<jj) : j = 1,...,|E|}, which is called the 
model-free empirical measure of Q. 

Suppose pt = (p(<j 1 ),..., /u(<T|s|)) is a model-free PL and 
1 > = (i/(cri),..., v(«T|s|)) is a model-free empirical measure. 
To quantify the difference between // and v. we define the 
model-free divergence between // and v as 

|E| -A T 

Df{v\\h) = ^^(CTjjiog (2) 

j =1 

where vfjf) = ma x(i/(crj),£) and A( cr j) = 

ma x(i/(<Tj),e),fj and e is a small positive constant 
introduced to avoid underflow and division by zero. 


B. A model-based method 

We now turn to the model-based method where the random 
process S = {G 1 ,...,G n } is assumed to be a Markov 
chain. Under this assumption, the joint distribution of Sf 
becomes p e (<& = Q) = pf ( 5 1 ) n''=i' Pe (ff ‘ +1 I 5*)’ where 
pf(-) is the initial distribution and pf (• | •) is the transition 
probability; all parametrized by 9 £ Cl. 

Let pf (oi,Oj) be the probability of seeing two con¬ 
secutive states (ay, of). We refer to the matrix = 

{pf (a t . as the model-based PL associated with 

0 £ Cl. Then, the family of model-based PLs SA B = 
{P^ : 9 £ f2} characterizes the stochastic process S. 

To characterize the observation Q, let 
1 " 

^I( cr i> cr j) = n ^2 = a i’9 l = a j)AG = — 5 l E l- 

n 1 =2 

(3) 

We define the model-based empirical measure of Q as the 
matrix £ G B = {S' B (o t ■ <Jj )}1 ■ The transition probability 

from Oi to <j 3 is simply S^ofaf, = • 

lsi 

Suppose II = { 7 r(cr,;, ctj)}I ■ l =1 is a model-based PL and 
| S | ’ 

Q = {q(<Ti, <Tj)}j is a model-based empirical measure. 
Let 7 r((Tj|(T i ) and q(oj ay) be the corresponding transition 
probabilities from ay to Oj. Then, the model-based diver¬ 
gence between n and Q is 

Db (Q \\ U ) = Y.Y. <Tj) log f? 3 1^1 , (4) 

i=tj=i 7 r '- < 7 jl f7 U 

where q(oi,Oj) = max(g(<Ji, of), e), if (ay, cry) = 
max(ir(oy, ay), e) for some small positive constant e intro¬ 
duced to avoid underflow and division by zero. Similar to 
the model-free case, we present the following definition: 

Definition 2 

(Model-Based Generalized Hoejfding Test). The model-based 
generalized Hoeffding test is to reject TLo when Q is in 

S* B = {g\MD B (£ G B \\Pf)>\}, 

where A is a detection threshold and infg e n -Db(£^||P^) 
is referred to as the generalized model-based divergence 
between £ G F and 3? B = {P^ : 9 £ fi}. 


Definition 1 

(Model-Free Generalized Hoeffding Test). The model-free 
generalized Hoeffding test [12] is to reject Ho if Q is in 

S* F = {G I i|f D F (£ G F \\pf) > A}, 

where A is a detection threshold and infggo Df(£%\\Po ) is 
referred to as the generalized model-free divergence between 
£p and ^ F = {p^ : 9 £ fl}. 

A similar definition has been proposed for robust localization 
in sensor networks [13]. One can show that this generalized 
Hoeffding test is asymptotically (as n —> 00 ) optimal in a 
generalized Neyman-Pearson sense; we omit the technical 
details in the interest of space. 


In this case as well, asymptotic (generalized) Neyman- 
Pearson optimality can be established. 

III. Network anomaly detection 
Pig. □ outlines the structure of our robust anomaly de¬ 
tection methods. We first propose our feature set (Sec. |III-| 
We assume that the normal traffic is governed by an 
underlying stochastic process S?. We assume the size of 
model-free and model-based PL families to be finite and 
propose a two-step procedure to estimate PLs from some 
reference data. We first inspect each feature separately to 
generate a family of candidate PLs (Sec III-C]>, wh ich is then 
reduced to a smaller family of PLs (Sec. 1II-D[ >. Lor each 
window, the algorithm applies the model-free and model- 
based generalized Hoejfding test discussed above. 
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counterpart of T , we number the symbols in g corresponding 
to fc(x), d a (x), b, and d t as features 1, 2, 3, 4. 

In our methods, flows in T are further aggregated into 
windows based on their flow transmission times. A window 
is a detection unit that consists of flows in a continuous 
time range, i.e., the flows in a same window are evaluated 
together. Let h be the interval between the start points of 
two consecutive time windows and w s be the window size. 


Fig. 1. Structure of the algorithms. 

A. Data representation 

In this paper, we focus on host-based anomaly detection, 
a specific application in which we monitor the incoming and 
outgoing packets of a server. We assume that the server 
provides only one service (e.g., HTTP server) and other 
ports are either closed or outside our interests. As a result, 
we only monitor traffic on certain port (e.g., port 80 for 
HTTP service). For servers with multiple ports in need of 
monitoring, we can simply run our methods on each port. 

The features we propose for this particular application 
relate to a flow representation slightly different from that 
of commercial vendors like Cisco NetFlow [14]. Hereafter, 
we will use “flows”, “traffic”, and “data” interchangeably. 
Let S = {s 1 ,... ,sl s l} denote the collection of all packets 
collected on certain port of the host which is monitored. In 
host-based anomaly detection, the server IP is always fixed, 
thus ignored. Denote the user IP address in packet s ? as 
x®, whose format will be discussed later. The size of s' is 
b l £ [0, oo) in bytes and the start time of transmission is 
t l £ [0, oo) in seconds. Using this convention, packet s* can 
be represented as (x®, b\ f®) for all i = 1 ,..., |<S|. 

We compile a sequence of packets s 1 ,..., s m with t\ < 

■ ■ ■ < t™ into a flow f = ( x,b,d t ,t ) if x = x 1 = • • • = x m 
and t l s — f® _1 < Sf for i = 2 ,..., m and some prescribed 
Sf £ (0, oo). Here, the flow size b is the sum of the sizes 
of the packets that comprise the flow. The flow duration is 
dt = t™ — t], Th e flow transmission time t equals the start 
time of the first packet of the flow t\. In this way, we can 
translate the large collection of packets S into a relatively 
small collection of flows T. 

Suppose X is the set of unique IP addresses in T. Viewing 
each IP as a tuple of integers, we apply typical K -means 
clustering on X. For each x X, we thus obtain a cluster 
label fc(x). Suppose the cluster center for cluster k is x ,,: ; 
then the distance of x to the corresponding cluster center is 
d a (x) = d(x,x fc * x )), for some appropriate distance metric. 
The cluster label /.::(x) and distance to cluster center d a (x) 
are used to identify a user IP address x, leading to our final 
representation of a flow as: 

f = (k{x),d a (x),b,d t ,t). (5) 

For each f, we quantize d a (x), b, and d t to discrete values. 
Each tuple of (k(x),d a (x),b,d t ) corresponds to a symbol 
in E = {1,..., A'} x E d a x Ej x Ed t , where E d a , Ef, and 
E d t are the quantization alphabets for distance to cluster 
center, flow size, and flow duration, respectively. Denoting 
by g the corresponding quantized symbol of f and by Q the 


B. Anomaly detection for dynamic networks 

For each window j, an empirical measure of Qj is calcu¬ 
lated. We then leverage the model-free and the model-based 
generalized Hoeffding test (Def. |l |2j i. which require a set of 
PLs {p^ : 0 £ 12} and {Pg : 9 £ 12}. We assume |f2| to 
be finite, and divide our reference traffic Q re f into segments; 
the traffic of each segment is governed by the same PL. The 
empirical measure of each segment is then a PL. 

Two flows are likely to be governed by a same PL if 
they have close flow transmission times. In addition, if the 
properties of the normal traffic change periodically, two 
flows are also likely be governed by a same PL when the 
difference of their flow transmission times is close to the 
period. Let t p be the period and let td be a window size 
characterizing the speed of change for the normal pattern. We 
could divide each period into \t p /td\ segments with length 
t,i, and combine corresponding segments of different periods 
together, resulting in [t p /td\ PLs. 

In practical networks, the period may vary with time, 
which makes it hard to estimate t p and td accurately. To 
increase the robustness of the set of estimated PLs to these 
non-stationarities, we first propose a large collection of 
candidates (Sec. |III-C| ) and then refine it (Sec. |III-D| >. 

C. Estimation oftd and t p 

This section presents a procedure to estimate td and t p by 
inspecting each feature separately. Recall that each quantized 
flow consists of quantized values of a cluster label, a distance 
to cluster center, a flow size and a flow duration, which are 
called features 1,..., 4, respectively. We say a quantized flow 
g belongs to channel a-b if feature a of g equals symbol b 
in quantization alphabet of feature a. We first analyze each 
channel separately to get a rough estimate of td and t p . Then, 
channels corresponding to the same feature are averaged to 
generate a combined estimate. 

For all flows in channel a-b, we calculate the intervals 
between two consecutive flows. Most of the intervals will be 
very small. If we divide the interval length to several bins 
and calculate the histogram, i.e., the number of observed 
intervals in each bin. The histogram is heavily skewed to 
small interval length, td could be chosen to be the interval 
length of the first bin (corresponding to the smallest interval 
length) whose frequency in the histogram is less than a 
threshold. In addition, there may be some large intervals if 
the feature is periodic. Fig. [2] shows an example of a feature 
that exhibits periodicity. There will be two peaks around t p \ 
and t P 2 in the histogram of intervals for flows whose values 
are between the two dashed lines. We can select t p such that 




























Fig. 2. Illustration of the peaks in periodic networks. 

(■ t p i +t p 2 ) /2 « tp/2. There can be a single or more than 
two peaks due to noise in the network; in either case, we 
choose the average of all peaks as an estimate of tp/2. 

If no channel of a feature a reports t p , the network is non¬ 
periodic according to the feature a. Otherwise, the estimate 
of t p for a feature a (denoted by f“) a is simply the average 
of all estimates for channels of the feature a. Although the 
estimate of only one channel is usually very inaccurate, the 
averaging procedure helps improve the accuracy. Similarly, 
the estimate for td for a feature a (denoted by //J) is the 
average of the estimates for all channels of the feature a. 

For each feature a, we generate some PLs using the 
estimate and £“. In case that some prior knowledge of td 
and t p is available, the family of candidate PLs can include 
the PLs calculated based on this prior knowledge. 

D. PL refinement with integer programming 

The larger the family of PLs we use in generalized 
hypothesis testing, the more likely we will overfit Gref, 
leading to poor results. Furthermore, a smaller family of 
PLs reduces the computational cost. This section introduces 
a method to refine the family of candidate PLs. 

For simplicity, we only describe the procedure for the 
model-free method. The procedure for the model-based 
method is similar. Hereafter, the divergence between a col¬ 
lection of flows and a PL is equivalent to the divergence 
between the empirical measure of these flows and the PL. 

Suppose the family (namely the set) of candidate PLs is 
the set V = {pf,...,p^} of cardinality N. Because no 
alarm should be reported for Q re f , or any segment of Gref, 
our primary objective is to choose the smallest set £? F C V 
such that there is no alarm for Gref- We aggregate Gref into 
M windows using the techniques of Sec. |III-A| and denote 
the data in window i as G’ re f- Let D,j = DF{£^ rBf || pf) 
be the divergence between flows in window i and PL j for 
i = 1,..., M and j = 1,..., N. We say window i is covered 
(namely, reported as normal) by PL j if D, 7 < A. With 
this definition, the primary objective becomes to select the 
minimum number of PLs to cover all the windows. 

There may be more than one subsets of V having the 
same cardinality and covering all windows. We propose a 
secondary objective characterizing the variation of a set of 
PLs. Denote by the set of intervals between consecutive 
window covered by PL j. The coefficient of variation for PL 
j is defined as cj = Std(^)/Mean(^), where Std(^) 
and Mean(^j) are the sample standard deviation and mean 
of set Stj, respectively. A smaller coefficient of variation 
means that the PL is more “regular.” 

We formulate PL refinement as a weighted set cover 


*t 

problem in which the weight of PL j is 1 + 7 c J v , where 
7 is a small weight for the secondary objective. Let 27 
be the 0-1 variable indicating whether PL i is selected or 
not; let x = ( 27 ,.. . ,Xjy). Let A = { 0 ^} be an M x N 
matrix whose (*, j)th element is set to 1 if D 7 < A and 
to 0 otherwise. Here, A is the same threshold we used in 
Del'. [I] Let c,, = (c \,..., c^). The selection of PLs can be 
formulated as the following integer programming problem; 
min 1 x + 7 C 7 X 

s.t. Ax > 1, ( 6 ) 

Xj e { 0 , 1 }, j = i,...,N, 

where 1 is a vector of ones. The cost function equals a 
weighted sum of the primary cost 1 x and the secondary 
cost c v x. The first constraint enforces there is no alarm for 
Gref f0r Vt. 

function HeuristicRefinePl(A, c v , r, y th ) 

Init: bestCost := 00 , 7 ;= 1, x* := 0 

while 7 > 7 th do 

x := GreedySolve(A, 7, c„), 7 := r7 
if 1 x + ythCyX. < bestCost then 
bestCost := 1 x + 7 t / l c l) x 
x* := x 

end if 
end while 
return x* 
end function 

function GreedySetCover(A, 7, c„) 

Init: x° := 0, C := 0 
while | C | < M do 

j+ := argmax ,- :xW=0 
x[j+] := 1, C := C U {i : a t j+ = 1} 
end while 
return x 
end function 

Algorithm 1: Greedy algorithm for PL refinement. 

Because © is NP-hard, we propose a heuristic algorithm 
to solve it (Algorithm [T}. HeurisTICRefinePl is the main 
procedure whose parameters are A, c v , a discount ratio 
r < 1, and a termination threshold 7^. In each itera¬ 
tion, the algorithm decreases 7 by a ratio r and calls the 
GreedyS ETC OVER procedure to solve (| 6 j. The algorithm 
terminates when 7 < 7 t h- In the initial iterations, the weight 
7 for the secondary cost is large so that the algorithm 
explores solutions which select PLs with less variation. Later, 
the weight 7 decreases to ensure that the primary objective 
plays the main role. Parameters 7 t h and r determine the 
algorithm’s degree of exploration, which helps avoid local 
minimum. In practice, you can choose small 7 t h and large r 
if you have enough computation power. 

GreedySetCover uses the ratio of the number of 
uncovered windows a PL can cover and the cost 1 + 7 c v 
as heuristics, where c v is the corresponding coefficient of 
variation. GreedySetCover will add the PL with the 
maximum heuristic value to until all windows are 

covered by the PLs in £? F . Suppose the return value of 
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Fig. 3. Results of PL refinement for the model-free method in a network 
with diurnal pattern. All figures share the ^e-axis. (A) and (B) plot the 
divergence of traffic in each window with all candidate PLs and with selected 
PLs, respectively. (C) shows the active PL for each window. (D) plots the 
generalized divergence of traffic in each window with all candidate PLs and 
selected PLs. 

HeuristicRefinePl is x*. Then, the refined family of PLs 

is @> F = {pj : Xj > 0, j = 1,. ..,N}. 

IV. Simulation results 

Lacking data with annotated anomalies is a common 
problem for validation of network anomaly methods. We 
developed an open source software package SADIT [15] to 
provide flow-level datasets with annotated anomalies. Based 
on the /s-simulator [16], SADIT simulates the normal and 
abnormal flows in networks efficiently. 

Our simulated network consists of an internal network 
and several Internet nodes. The internal network consists 
of 8 normal nodes CT1-CT8 and 1 server SRV containing 
some sensitive information. There are also three Internet 
nodes INT1-INT3 that access the internal network through 
a gateway {GATEWAY). For all links, the link capacity is 10 
Mb/s and the delay is 0.01 s. All internal and Internet nodes 
communicate with the SRV and there is no communication 
between other nodes. The normal flows from all nodes to SRV 
have the same characteristics. The size of the normal flows 
follows a Gaussian distribution N(m(t), a 2 ). The arrival 
process of flows is a Poisson process with arrival rate A (t). 
Both m{t) and A (t) change with time t. 

We assume the flow arrival rate and the mean flow size 
have the same diurnal pattern. Let p{t) be the normalized 
average traffic to American social websites [17], which varies 
diurnally, and assume A (t) = Ap(t) and m(t) = M p p(t), 
where A and M p are the peak arrival rate and the peak mean 
flow size. In our simulation, we set M p = 4 Mb, a 2 = 0.01, 
and A = 0.1 fps (flow per second) for all users. Using this 
diurnal pattern , we generate reference traffic Q re f for one 
week (168 hours) whose start time is 5 pm. For window 
aggregation, both the window size w s and the interval h 
between two consecutive windows is 2, 000 s. The number 
of user clusters is K = 2. The number of quantization levels 
for feature 2, 3, 4 are 2, 2, and 8. An estimation procedure 
is applied to estimate td and t p . The estimate of the period 
based on flow size is t p = 24.56 h with only 2.3% error. 


Fig. 4. Results of PL refinement for the model-based method in a network 
with diurnal pattern. All figures share the x-axis. (A) and (B) plot the 
divergence of traffic in each window with all candidate PLs and with selected 
PLs, respectively. (C) shows the active PL for each window. (D) plots the 
generalized divergence of traffic in each window with all candidate PLs and 
selected PLs. 

A. PL refinement 

For the model-free method, there are 64 candidate model- 
free PLs. The model-free divergence between each window 
and each candidate PL is a periodic function of time, too. 
Some PLs have smaller divergence during the day and some 
others have smaller divergence during the night (cf. Fig.[3j\). 
However, no PL has small divergence for all windows. 3 
PLs out of the 64 candidates are selected when the detection 
threshold is A = 0.6 (cf. Fig. [3j3). The 3 selected PLs are 
active during day, night, and the transitional time, respec¬ 
tively (cf. Fig. [3p for the active PLs of all windows). For 
all windows, the model-free generalized divergence between 
Q re f and all candidate PLs is very close to the divergence 
between Q re f and only the selected PLs (Fig. [3p). The 
difference is relatively larger during the transitional time 
between day and night. This is because the network is more 
dynamic during this transitional time, thus, more PLs are 
required to represent the network accurately. 

For the model-based method, there are 64 candidate 
model-based PLs, too. Similar to the model-free method, the 
model-based divergence between all candidate PLs and flows 
in each window in Q re f is periodic (Fig. |4|\) and there is no 
PL that can represent all the reference data Q re f. 2 PLs are 
selected when A = 0.4 (Fig. |4jB). One PL is active during the 
transitional time and the other is active during the stationary 
time, which consists of both day and night (Fig. |4p). As 
before, the divergence between each G' rr f and all candidate 
PLs is similar to the divergence between Q' rf .j and just the 
selected PLs (Fig. |4p). 

The results show that the PL refinement procedure is effec¬ 
tive and the refined family of PLs is meaningful. Each PL in 
the refined family of the model-free method corresponds to a 
“pattern of normal behavior,” whereas, each PL in the refined 
family of the model-based method describes the transition 
among the “patterns”. This information is useful not only 
for anomaly detection but also for understanding the normal 
traffic in dynamic networks. 
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Fig. 5. Comparison of vanilla and robust methods. (A), (B) show detection 
results of vanilla and robust model-free methods and (C), (D) show detection 
results of vanilla and robust model-based methods. The horizontal lines 
indicate the detection threshold. 


B. Comparison with vanilla stochastic methods 

We compared the performance of our robust model-free 
and model-based method with their vanilla counterparts ([5], 
[18]) in detecting anomalies. In the vanilla methods, all 
reference traffic Q re f is used to estimate a single PL. We 
used all methods to monitor the server SRV for one week 
(168 hours). 

We considered an anomaly in which node GT 2 increases 
the mean flow size by 30% at 59h and the increase lasts 
for 80 minutes before the mean returns to its normal value. 
This type of anomaly could be associated with a situation 
when attackers try to exfiltrate sensitive information (e.g., 
user accounts and passwords) through SQL injection [19]. 

For all methods, the window size is w s = 2000s and the 
interval h = 2000s. The quantization parameters are equal 
to those in the procedure for analyzing the reference traffic 
Gref - The simulation results show that the robust model-free 
and model-based methods perform better than their vanilla 
counterparts for both types of normal traffic patterns (Fig. [5j. 

The diurnal pattern has large influence on the results of the 
vanilla methods. For both the vanilla and the robust model- 
free methods, the detection threshold A equals 0.6. The 
vanilla model-free method reports all night traffic (between 
3 am to 11 am) as anomalies (Fig. |5|\). The reason is that 
the night traffic is lighter than the day traffic, so the PL 
calculated using all of Q re f is dominated by the day pattern, 
whereas the night pattern is underrepresented. In contrast, 
because both the day and the night pattern is represented in 
the refined family of PLs (Fig. HP), the robust model-free 
method is not influenced by the fluctuation of normal traffic 
and successfully detects the anomaly (Fig. [5j3). 

The diurnal pattern has similar effects on the model- 
based methods. When the detection threshold A equals 0.4, 
the anomaly is barely detectable using the vanilla model- 
based method (Fig. (5p). Similar to the vanilla model-free 
method, the divergence is higher during the transitional time 
because the transition pattern is underrepresented in the PL 
calculated using all of Q re f. Again, the robust model-based 
method is superior because both the transition pattern and 


the stationary pattern are well represented in the refined 
family of PLs (Fig. |5p). 

V. Conclusions 

The statistical properties of normal traffic are time-varying 
for many networks. We propose a robust model-free and a 
robust model-based method to perform host-based anomaly 
detection in those networks. Our methods can generate a 
more complete representation of the normal traffic and are 
robust to the non-stationarity in networks. 
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